Skip to content

Conversation

@colesbury
Copy link
Contributor

@colesbury colesbury commented Jan 21, 2026

Description

The creation of the iterator class needs to be synchronized.

Fixes #5970

Suggested changelog entry:

  • Fix race condition in py::make_key_iterator with free threaded Python

colesbury and others added 2 commits January 21, 2026 16:46
The creation of the iterator class needs to be synchronized.
struct internals {
#ifdef Py_GIL_DISABLED
pymutex mutex;
pyrecursive_mutex mutex;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@henryiii @rwgk - I could use some advice here. I probably should have used some sort of recursive mutex here from the start -- it's pretty difficult to do the locking for make_iterator_impl without it.

I think changing this would require bumping PYBIND11_INTERNALS_VERSION, at least for Py_GIL_DISABLED builds. Is that acceptable?

Alternatively, maybe we should use something like Py_BEGIN_CRITICAL_SECTION_MUTEX, which supports recursion and eliminates some potential lock ordering deadlocks if you call into Python. The downside is that it is 3.14+ only.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think changing this would require bumping PYBIND11_INTERNALS_VERSION, at least for Py_GIL_DISABLED builds. Is that acceptable?

After the v3.0.2 release, yes. We already have three other PRs that need an internals version bump.

I was planning to release today (PR #5969), but there was a show stopper. We're waiting for a fix for the segfault.

My thinking: the race isn't new, therefore it would seem reasonable to defer fixing it until after the v3.0.2 release, when we have a window of opportunity to bump the internals version.

Caveat: I don't have enough background to decide between the internals version bump and the critical section alternative.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Py_BEGIN_CRITICAL_SECTION_MUTEX sounds fine, I'd be okay to deprecate and remove Python 3.13t support. Maybe we could just fix this bug on 3.14t, and then drop 3.13t in 3.1?

We are thinking about dropping 3.13t in cibuildwheel too.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fine to make the lock recursive in 3.1, we have several ABI bumps coming up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great, I think dropping 3.13t makes sense. I switched to PyCriticalSection_BeginMutex.

I saw there's also a py::scoped_critical_section. Not sure if you'd want to combine them.

@ikrommyd
Copy link

Thanks for working on this so quickly! I can try it with the full library that I discovered the issue with once it is in a ready-to-review state 😃

@colesbury colesbury marked this pull request as ready for review January 22, 2026 17:48
@colesbury
Copy link
Contributor Author

@ikrommyd - if you can run test with the full library, that would be great.

@ikrommyd
Copy link

ikrommyd commented Jan 23, 2026

Looks good from my side! (if I checkout pybind11 master branch, tests/test_core.py fails with the original error reported in the issue TypeError: Object of type 'iterator' is not an instance of 'iterator')
image
image

@henryiii
Copy link
Collaborator

@rwgk what should we do with 3.13t here? Okay to drop it in a patch release, or should we gate this for 3.14t+ only (fine to leave the bug unfixed on 3.13t), and drop in the next minor release instead? CPython 3.13t was always experimental, while 3.14t is not.

@rwgk
Copy link
Collaborator

rwgk commented Jan 24, 2026

Okay to drop it in a patch release

Move vote: yes, drop

I don't think 3.13t is worth the trouble doing anything special.

void unlock() { PyMutex_Unlock(&mutex); }
};

class pycritical_section {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to be careful about copy/move here?

... give me a moment to look at this with the help of some LLM ...

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Result: be01d1e

(created with Cursor Composer 1 model)

rwgk added 3 commits January 24, 2026 10:23
The pycritical_section class is a RAII wrapper that manages a Python
critical section lifecycle:
- Acquires the critical section in the constructor via
  PyCriticalSection_BeginMutex
- Releases it in the destructor via PyCriticalSection_End
- Holds a reference to a pymutex

Allowing copy or move operations would be dangerous:

1. Copy: Both the original and copied objects would call
   PyCriticalSection_End on the same PyCriticalSection object in their
   destructors, leading to double-unlock and undefined behavior.

2. Move: The moved-from object's destructor would still run and attempt
   to end the critical section, while the moved-to object would also try
   to end it, again causing double-unlock.

This follows the same pattern used by other RAII lock guards in the
codebase, such as gil_scoped_acquire and gil_scoped_release, which also
explicitly delete copy/move operations to prevent similar issues.

By explicitly deleting these operations, we prevent accidental misuse
and ensure the critical section is properly managed by a single RAII
object throughout its lifetime.
Python 3.13t was experimental, while Python 3.14t is not. This PR
uses PyCriticalSection_BeginMutex which is only available in Python
3.14+, making Python 3.13t incompatible with the changes.

Removed all Python 3.13t CI jobs:
- ubuntu-latest, 3.13t (standard-large matrix)
- macos-15-intel, 3.13t (standard-large matrix)
- windows-latest, 3.13t (standard-large matrix)
- manylinux job testing 3.13t

This aligns with the decision to drop Python 3.13t support as
discussed in PR pybind#5971.
After removing Python 3.13t support (incompatible with PyCriticalSection_BeginMutex
which requires Python 3.14+), we're adding replacement jobs using Python 3.13
(default) to maintain test coverage in key dimensions:

1. ubuntu-latest, Python 3.13: C++20 + DISABLE_HANDLE_TYPE_NAME_DEFAULT_IMPLEMENTATION
   - Replaces: ubuntu-latest, 3.13t with same config
   - Maintains coverage for this specific configuration combination

2. macos-15-intel, Python 3.13: C++11
   - Replaces: macos-15-intel, 3.13t with same config
   - Maintains macOS coverage for Python 3.13

3. manylinux (musllinux), Python 3.13: GIL testing
   - Replaces: manylinux, 3.13t job
   - Maintains manylinux/musllinux container testing coverage

These additions are proposed to get feedback on which jobs should be kept
to maintain appropriate test coverage without the experimental 3.13t builds.
@rwgk
Copy link
Collaborator

rwgk commented Jan 24, 2026

@rwgk what should we do with 3.13t here? Okay to drop it in a patch release, or should we gate this for 3.14t+ only (fine to leave the bug unfixed on 3.13t), and drop in the next minor release instead? CPython 3.13t was always experimental, while 3.14t is not.

While I was at it, I went ahead and added two commits:

I'm not sure about the second commit: @henryiii which of these jobs would you add?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG]: make_key_iterator type creation does not appear to be thread-safe

4 participants